Adaptation to New Microphones Using Tied-Mixture Normalization
نویسندگان
چکیده
C a m b r i d g e M A 02138 t N o r t h e a s t e m U n i v e r s i t y B o s t o n M A 02115 In this paper, we present several approaches designed to increase the robustness of BYBLOS, the BBN continuous speech recognition system. We address the problem of increased degradation in • performance when there is mismatch in the characteristics of the training and the test microphones. We introduce a new supervised adaptafi.~n algor/thm that computes a transformation from the trainhag microphone codebook to that of a new microphone, given some information about the new microphone. Results are reported for the development and evaluation test sets of the 1993 ARPA CSR Spoke 6 WSJ task, which consist of speech recorded with two al• temate microphones, a stand-mount and a telephone microphone. The proposed algorithm improves the performance of the system • • when tested with the stand-mount microphone by reducing the difference ha error rate between the high quality training microphone and the alternate stand-mount microphone recordings by a factor of 2. Several results are presented for the telephone speech leading • to important conclusions: a) the performance on telephone speech is dramaticaUy improved by simply retraining the system on the high-quality training data after they have been bandlimited in the telephone bandwith; and b) additional training data recorded with the high quality microphone give luther substantial improvement ha performance. 1. I N T R O D U C T I O N Interactive speech recognition systems are usually trained on substantial amounts of speech data collected with a high quality close-talking microphone. During recognition, these systems require the same type of microphone to be used in order to achieve their standard accuracy. This is a highly restdcting condition for practical applications of speech recognition systems. One can imagine a situation, where it would be desirable to use a different microphone for recognition than the one with which the training speech was collected. For example, some users may not want to wear a headmolmted microphone. Others may not want to pay for a high quality microphone. Additionally, many applications involve recognition of speech over telephone lines and telephone sets with high variability in quality and characteristics. However, we know that even highly accurate speech recognition systems perform very poorly when they are tested with microphones with different characteristics than the ones that they were trained on [1]. There is a wide range of approaches in order to compensate for this degradation in performance including: Retrain the HMMs with data collected with the new microphone encountered during the recognition stage, a rather expensive approach for real applications, or by training on a large number of microphones in the hope that the system will obtain the necessary robustness. Use robust signal processing algorithms. Develop a feature transformation that maps the alternate microphone data to training microphone data. Use statistical methods in order to adapt the parameters of the acoustic models. In previous work we had discussed the use of Cepstmm Mean Subtraction and the RASTA algorithm as two simple signal processing algorithms to compensate the degradation caused by an alternate channel [7]. In this pape r, we present an approach towards feature mapping by modeling the difference between the test and the training microphone, prior to reco tion. We have developed the Tied-Mixture Normalization Algorithm, a technique for adaptation to a new microphone based on modifying the continuous densities in a tied-mixture I-IMM system, using a relatively small amount of stereo training speech. This method is presented in detail in Section 2. In Section 3 we describe several experiments on a known microphone task and the effect of the adaptation method in the performance of the recognition system. 2. T I E D M I X T U R E N O R M A L I Z A T I O N In a Tied-Mixture Hidden Markov Model (TM-HMM) system [2, 6], speech is represented using an ensemble of Gaussian mixture densities. Every frame of speech is represented as a Gaussian nfixture model. Specifically the probability density function for an observation conditioned on the H/vIM state is expressed as:
منابع مشابه
Bandwidth of perceived inharmonicity for physical modeling of dispersive strings
597 be more useful because of the smaller amount of data available per class when there are multiple classes. The problem of channel compensation, particularly in the model space, can be viewed as a special case of the more general problem of adaptation. Indeed, the ML feature-space method described here is similar to techniques used for speaker adaptation. For word recognition applications whe...
متن کاملAdvanced Feature Normalization and Rapid Model Adaptation for Robust In- Vehicle Speech Recognition
In this study, we present advanced feature normalization and rapid model adaptation for robust in-vehicle speech recognition. For feature normalization, we use a combination of recently established quantile-based cepstral dynamics normalization (QCN) and low pass temporal filtering (RASTALP). Similar to cepstral mean normalization (CMN), QCN aims at alleviating the mismatch between ASR acoustic...
متن کاملRobust speaker recognition through acoustic array processing and spectral normalization
The development of a robust speaker recognition system obtained through the joint use of acoustic array processing and spectral normalization as input to a Gaussian Mixture Model speaker recognition system is described in this paper. Results obtained with these techniques have been reported previously by the authors [10], but operational problems appear if extensive testing with different confi...
متن کاملAutomatic transcription of voicemail at AT&T
This paper reports on the automatic transcription accuracy of voicemail messages. It shows that vocal tract length normalization and adaptation using linear transformations, proven to improve accuracy on the Switchboard task, provide similar accuracy improvements on this task. Direct application of the normalization techniques is complicated by the fragmentation of the data. However, unsupervis...
متن کاملComputational Cost Reduction Using Coincident Boundary Microphones for Convolutive Blind Signal Separation
This work demonstrates that an acoustic mixture system can be configured to accomplish the instantaneous mixture model conditions. The system in achievable using coincident boundary microphones, which is a suitable disposition to uniform the propagation channel delays and to reduce the number of reflections characterizing the system impulse response. With the use of coincident microphones, conv...
متن کامل